Lecture #14 
- Hash Tables 
- The Modulus Operator 
- Closed hash tables 
- Open hash tables 
- Hash table efficiency and “load factor" 
- Hashing non-numeric values 


- unordered_map: A hash-based STL 
map class 


e (Database) Tables 


Big-OH Craziness 


Consider a binary search tree that holds N student records, all 
indexed by their name. 


Each student record contains a linked-list of the L classes that 
they have taken while at UCLA. 


What is the big-oh to determine if a 
student has taken a class? 
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Hash Tables 


Hash Tables 
Why should you care? 


Hash tables are often THE most 
efficient way to search for data! 


You can search a hash table with 
billions of items in just microseconds! 


Theyre used in search engines, 
antivirus scanners, navigation systems, 
social network sites, etc. 


And because you'll be asked about 
them in job interviews and on exams. 


So pay attention! 


The Modulus Operator 


In C++, the % operator is used to divide two numbers 
and obtain the remainder. 


12 R 34 
For example, if we compute: 100)1234 
int x= 1234 % 100; 100 

the value of x will be 34. 534 

200 

Now, as it turns out, the modulo 34 
operator has an interesting 

property! 


Let's see if you can 
figure out what it is... 


l The Modulus Operator 


Let's modulus-divide a bunch of numbers 


0%5=0 
1%5=1 
h9: 
3%5=3 
4%5=4 
5%5=0 
6%5=1 
7Th5=2 
8% 5 = 3 
9%5=4 
10%5=0 


11%5=1 


by 5 and see what the results are! 


What do you notice? 


When we divide numbers by 5, all of the 
remainders are less than 5 (between 0-4)! 


Let's try again with 3 for fun! 


When we divide numbers by 3, all of the 
remainders are less than 3 (between 0-2)! 


And as you'd guess, if you divided a bunch of 
numbers by 100,000, the remainders would all 
be less than 100,000 (between 0-99,999)! 


Rule: When you divide by a given value 
N, all of your remainders are guaranteed 
to be between O and N-1! 


The “Hash Table” 


OK... So far, what's the most efficient ADT 
we know of to insert and search for data? 


Right! The (balanced) Binary Search Tree - it gives 
us O(logN) performance! 


Can we do any better? If so, how much better? 


Challenge: 


Build an ADT that holds a bunch of 9-digit student ID#s 
such that the user can add new ID#s or 
determine if the ADT holds an existing ID# 
in just 1 step - not O(N) or O(logN) but O(1). 


The (Almost) Hash Table 


How can we create an ADT where we can insert the 
9-digit student ID#s for all 50,000 UCLA students... 


and then find if our ADT holds a given ID# 
in just one algorithmic step?!?!? 


That can't be done... can it? 


It can, and let's see how! 


Let's use a really, really large array to hold our #s. 


The (Almost) Hash Table 


class AlmostHashTable 
ere 
void addItem(int n) 
, m array[n] = true; 
a holdsItem(int q) 
{ 


return m_array[q] == true; 


} 
private: 
bool m_array[100000000]; // big! 
}; 


int main () 
{ 
AlmostHashTable x; 


x.addiItem (400683948) ; 
if (x. holdsItem(1234) != true) 


cout<< “Couldn’t find it!”; 


Idea: 
Let's create an array with 
1 billion slots - one slot for 
each valid ID#. 


To add a new LD# witha 
value of N, we'll simply set 
array[N] to true. 


To determine if our array 
holds a previously-added 
value Q, simply check if 

array[Q] is true. 


m_array 


000,000,000 = 


400 683.948 RUE 


999,999 999 


The (Almost) Hash Table 


OK - so now we know how fo build an O(1) search! 
But what's the problem with our ADT? 


It's really, really inefficient: 
Our array has 1 billion slots 

yet there are only 50,000 UCLA student IDs 
we could possibly add to it, 

so were wasting 999,950,000 of the slots... 


It would be great if we could use the same algorithm 
but with a smaller array, say one with 100,000 slots 
instead of 1 billion! 


The (Almost) Hash Table 


Lets say we want to keep track of our 50,000 ID#s 
in an array with just 100,000 slots. 


If we just try to use our 9-digit number to index the array, 
there won't be room! 


What we need is some cool mathematical function that takes ina 
9-digit ID# and somehow converts it to a unique slot number 
between O and 99,999 in the array! 


ID#s [setg | Slot #s 
Range: 0-999,999,999 | amapping | Range: 0-99,999 
function! 
000,000,000 
024,641,083 
f(x) 
605,172,432 


ee 99,999 
723,992 279 
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The (Almost) Hash Table 


class AlmostHashTable2 
{ 
public: 
void addItem(int n) 
{ 
int slot = mapFunc(n) ; 
m array[slot] = true; 
} 
bool containsItem(int q) 
{ 
int slot = mapFunc(q) ; 
return m_array[slot] == true; 
} 


private: 
int mapFunc(int idNum) 


Oe es ae 


bool m_array[100000]; // not so big! 


ae 


This is the "almost" hash table, v2. It's still NOT a 


valid hash table, but it's getting closer! 


Assuming we can come up with sucha 
mapping function (mapFunc), we can use a 
(small) 100,000 element array to hold our 
data... 


For a given student ID x, we compute 
slot=mapFunc(n) to get its slot number in 
the array. 


We then set the slot to TRUE to indicate 
that a student with that ID is in our table. 


By the way, the official CS lingo for a 
“slot” in the array is a “bucket.” So that's 
what we'll call our slots from now on! © 


Ok, so what does mapFunc() look like? © 
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The Mapping Function 
How can we write a mapFunc that converts our large ID# 


into a bucket # that falls within our 100,000 element array? 


This line takes an input 
value idNum and returns 
an output value between 
O and ARRAY_SIZE - 1. 
(O to 99,999) 


And this 
corresponding 
value can be used 


int mapFunc(int idNum) 


{ 
const int ARRAY_SIZE = 100000; 


int bucket = idNum % ARRAY_ SIZE; 
return bucket; 


RIGHT! The C++ % operator to pick a bucket in 
(aka the modulus division operator) our 100,000 
does exactly what we want!!! element array! 


So now for each input ID# we can 
compute a corresponding value 
between 0-99,999! 


4 The (Almost) Hash Table 


Let's see how it works. 


m_array o) 


class AlmostHashTable2 
{ 

public: 

void addItem(int n) 
{ 


int bucket = mapFunc (n); 
m array [bucket] = true; 


} 


private: 
int mapFunc (int idNum) 


{ [5223] 
return idNum % 100000; [5224] 
} The true value in slot [5225] 


83,948 indicates that 
the value 400,683,948 
is held in our ADT. 


bool m_array[100000]; // not 
}; 


int main () 


{ 


183948 


400,683,948 % 100,000 
= 83,948 


true = 


AlmostHashTable2 x; 
x.addItem (400683948; 
x.addItem(111105224) ; 
x.addiItem (222205224) ; 


i The (Almost) Hash Table 


class AlmostHashTable2 
{ 

public: 

void addItem(int n) 
{ 


Let's see how it works. 


int bucket = mapFunc(n) ; 
m array [bucket] = true; 


The true value in slot 
5,224 indicates that 
the value 111,105,224 
is held in our ADT. 


} 


private: 
int mapFunc (int idNum) 
{ 
return idNum % 100000; 


} 


bool m array[100000]; // not so big! 
}; 


int main () 
{ Z| 


AlmostHashTable2 x; l urios 224 % 100 000 Z 
x.addItem (400683948) ; en: 
5,224 


x.addItem(111105224 
x.addItem (222205224) ; 


[83948] true | 
[83949] 


s The (Almost) Hash Table 


class AlmostHashTable2 Ok, let's try to add the last ID# (222205224) to 
{ our almost hash table... 
fa. Like 111,105,224, mapFunc() computes a slot for it 
public: of 5,224 (222,205,224 % 100,000 == 5,224) 
void addItem(int n) But wait! We already stored a true value in bucket 
{ 5,224 to represent 111,105,224! 


: = . Nnow things are ambiguous! How can we tell if my 
int bucket = mapFunc(n) ; hash table holds 222 205 224 or 111,105,224? 


m array [bucket] = true; This is called a "collision" and it's a real problem! 


m_array o] 


} 


private: 
int mapFunc (int idNum) [1] 
{ 
return idNum % 100000; 


} 


bool m_array[100000]; // no so big! 
}; 


int main () 


{ ~ 

AlmostHashTable2 x; 
x.addItem (400683948) ; [83947] | 
[83948]| true 


x.addItem(111105224)7 222,205,224 % 100,000 
p5 © [183949] 


x.addItem (222205224) 
= 5,224 


~ The (Almost) Hash Table: A problem 


A collision is a 


condition where two or arrayioy 
more values both map [1] 
to the same bucket in 

the array. 111,105,224 


This causes ambiguity, 
and we can't tell what 


value was actually Beara 
stored in the array! [83947] 
[83948] 
Let's see how to fix [83949] 


this problem! 


REAL Hash Tables 


There are many schemes for dealing with collisions, 
and today we'll learn two of the most popular... 


The Closed Hash Table The 
with “Linear Probing" “Open Hash Table” 
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Closed Hash Table with Linear Probing: Insertion 


Linear Probing Insertion: 


As before, we use our mapping function to locate arrayro] 
the right bucket in our array. [1] 


If the target bucket is empty, we can store 
our value there. 


However, instead of storing true in the bucket, 5223 
we store our full original value - this prevents 111,105,224 > f(x) ] 
Amen, [5228] 111,105,224 


If the bucket is occupied, we scan down from 25 ]|222,205,224 
that bucket until we hit the first empty bucket. 222,205,224 > f(x) 222,205,224 


We put the new value there. 


So first we'd add 111,105,224 by computing its slot number, 5,224. 
We'd find that this slot is empty, and then we'd stick the value [99997] 
111,105,224 in that slot. 
Then we'd compute the slot for 222,205,224 and get 5,224. We'd find [99998] 
that slot 5,224 is already occupied. So we start "probing" down until 
we find an empty slot (e.g., look in slot 5,225, then 5,226, etc., [99999] 


wrapping around at slot 99,999 back to zero). The moment we find an 
empty slot, we place our value (222,205,224) in that slot. 


In this case, slot 5,225 is unoccupied so we place our value 
222,205,224 in that slot. 
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Closed Hash Table with Linear Probing: Insertion 


Linear Probing Insertion: 


Sometimes, you'll need to insert an item near the end 
of the table... 


For instance, let's say we want to insert a new value of 
640,099,998 into our hash table. Notice that this would 
normally go into slot 9,998. But that slot is already filled 

with 475,699,998. 


111,105,224 


So we start probing downward. We look at slot 99,999 
and see that it's already filled too (with 100,399,999)! [5225] 
So we want to keep probing down to find an empty slot. 

But if we do, we'll go past the end of the array. What do 


we do? 
99997 
Well, if you run into a collision on the last 640,099,998 > f(x) J 
bucket, and go past the end... [99998 
You simply wrap back around the top to slot zero! [99999] 100,399,999 
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Closed Hash Table with Linear Probing: Searching 


Linear Probing Searching: 


arrayro] 
To search our hash table, we use a similar 1 
approach. [1] 


We compute a target bucket number 
with our mapping function. 


111,105,224 > f(x) +5223 I 
We then look in that bucket for our value. Eam 


If we find it, great! 


If we don't find our value, we probe 
linearly down the array until we either find 
our value or hit an empty bucket. 


, , , [99997] 
If while probing, you run into an empty bucket, 

it means: your value isn't in the array. [99998] 

As before, if we go past the end of the array, we [99999] 


just wrap around back to slot zero and continue 
searching. 
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Closed Hash Table with Linear Probing 


This approach addresses 
collisions by putting each 
value as close as possible to 
its intended bucket. 


Since we store every original 
value (e.g., 111,105,224) in 
the array, there is no chance 
of ambiguity. 


array; 
[1] 


[99997] 
[99998] 
[99999] 
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Closed Hash Table with Linear Probing 


So why do we call this a 
“Closed” hash table??? 


array; 
Since our data is stored ina [1] 


fixed-size array, there are a 
fixed (closed) number of buckets 
for us to put values. 


Once we run out of empty buckets, 
we can't add new values... 


Linked lists and binary search trees [99997] 
don't have this problem! [99998] 
[99999] 


Ok, let's see the C++ code now! 


Linear Probing Hash Table: The Details 


In a Linear Probing Hash Table, each bucket in the array 
is just a C++ struct. 
Each bucket holds two items: 


1. A variable to hold your value (e.g., an int for an ID#) 


2. A “used” field that indicates if this bucket in the 
hash table has been filled or not. 


struct BUCKET 
{ 


// a bucket stores a value (e.g. an ID#) 


ee If this field is false, it means 
int idNum; that this Bucket in the array is 
bool used; // is bucket in-use? le Ap ue Atel ay 


}: 


then it means this Bucket is 
already filled with valid data. 


25| const in NUM BUCK = 10; 
= Since our array has 10 slots, we 


will loop up to 10 times looking 

for an empty space. If we don't 
find an empty space after 10 

tries, our table is full! 


class HashTable 
{ 
public: 


First we compute the 
starting bucket number. 


void insert(int idNum) 
We'll store our new item in 


int bucket = mapFunc (idNum) ; Eo | the first unused bucket that 


we find, starting with the 
for (int tries=0;tries<NUM BUCK; tries++) bucket selected by our 
= Ld mapping function. 
if (m buckets [bucket] .used == false) i 
{ 


If the current 
already occupied by an item, 
advance to the next bucket 
(wrapping around from slot 9 

back to slot O when we hit the 


m buckets [bucket] .idNum = idNum; 
m buckets [bucket] .used = true; 
return; 


} 
bucket = (bucket + 1) % NUM BUCK; | 
Here's our mapping function. 
As before, we compute our bucket 
number by dividing the ID number by 
the total # of buckets and then taking 


the remainder (%). 


// no room left in hash table!!! 
} 


private: 
int mapFunc(int idNum) const 
{ return idNum % NUM BUCK; } 


Our hash table has 10 slots, ak 


. 
7 


26| const in NUM_ BUCK = 10; 


class HashTable 

{ 

public: 29 
void insert(int idNum) 
{ 


bucket| 9 | 


int bucket = mapFunc (idNum) ; 


for (int tries=0;tries<NUM_ BUCK; tries++t) 
{ 
if (m buckets [bucket] .used == false) 
{ 
m buckets [bucket] .1dNum = idNum; 
m buckets [bucket] .used = true; 
return; 
} 
bucket = (bucket + 1) % NUM BUCK; 
} 
// no room left in hash table!!! 
} 


private: 
int mapFunc(int idNum) const 
{ return idNum % NUM BUCK; } 


BUCKET m buckets [NUM _ BUCK] ; 
}; 


Linear Probing: 
Inserting 


idNum: [___] used:| | 
CN P E 


9[idNum: 29 Jused: (F) 


main () 


{ 


HashTable ht; 


ht.insert (29); 


ht.insert (65) ; 
ht.insert (79); 


21| const in NUM_ BUCK = 10; 


class HashTable 
{ 


public: 65 
void insert(int idNum) bucket| 5 | 
{ 


int bucket = mapFunc (idNum) ; 


for (int tries=0;tries<NUM_ BUCK; tries++) 
{ 


if (m buckets [bucket] .used == false) 
{ 


m buckets [bucket] .1dNum = idNum; 


m buckets [bucket] .used = true; 
return; 

} 

bucket = 


(bucket + 1) % NUM BUCK; 
} 


// no room left in hash table!!! 
} 


private: 


int mapFunc(int idNum) const 
{ return idNum % NUM BUCK; } 


BUCKET m buckets [NUM _ BUCK] ; 
}; 


Linear Probing: 
Inserting 


iaun a ses 


‘(Nun i a used: idNum: in used: a 
idNum: [____] used: | f | 
idNum: used: 


main () 
{ 
HashTable ht; 


ht.insert (29); 


ht.insert (65) ; 
ht.insert (79); 


28 const in NUM_ BUCK = 10; 


class HashTable 

{ 

public: 79 
void insert(int idNum) 
{ 


bucket| O | 


int bucket = mapFunc (idNum) ; 
for (int tries=0;tries<NUM BUCK; tries++) 
{ 

if (m buckets [bucket] .used == false) 


{ 
m buckets [bucket] .1dNum = idNum; 


m buckets [bucket] .used = true; 
return; 

} 

bucket = (bucket + 1) % NUM BUCK; 


} 
// no room left in hash table!!! 


} 


private: 
int mapFunc(int idNum) const 


Q 


{ return idNum % NUM BUCK; } 


BUCKET m buckets [NUM_BUCK] ; 
}; 


Linear Probing: 
Inserting 


i : C79 | used: (f } 
Num: Tused: |F] 
idNum: C lused:| £f | 
idNum:[  ]used: 
idNum: en Heer — 


HashTable ht; 


ht.insert (29); 


ht.insert (65); 
ht.insert (79); 


29 const in NUM BUCK = 10; ; 
= Compute the starting 


class HashTable bucket where we expect 
{ to find our item. 


public: 


bool search(int idNum) 


int bucket = mapFunc (idNum) ; 


if (m buckets [bucket] .used == false 
return false; 

if (m buckets [bucket] .idNum == idNum 
return true; 


bucket = (bucket + 1) % NUM _BUCK; 


return false;// not in the hash table 


private: 
int mapFunc(int idNum) cons 
{ return idNum % NUM BUCK; 


BUCKET m buckets [NUM _ BUCK] ; 
a 


Z| 


for (int tries=0;tries<NUM BUCK; tries++t) 


Since we may have 
collisions, in the worst 
case, we may need to 
check the entire table! 
(10 slots) 


If we reach an empty 
bucket (and haven't yet 
found our item) then we 

know our item is not in the 

table! 


Ve the bucket is 
in-use. If it also 
holds our ID# then we've 
found our item and we're 
done. 


If we didn't find our item, 
advance to the next bucket in 
search of it. 

Wrap around when we reach 
the end of the array. 


If we went through every 
bucket and didn't find our item, 
then it's not in the hash table! 
Tell the user. 


const in NUM_ BUCK 


class HashTable 
{ 
public: 29 


bool search(int idNum) b 
ucket| 9 ] 


{ 
int bucket = mapFunc (idNum) ; 


for (int tries=0;tries<NUM_BUCK;tries++) 


{ 
if (m buckets [bucket] .used == false) 


return false; 
if (m buckets [bucket] .idN == idNum) 


return true; 


bucket = (bucket + 1) % NUM_BUCK; 
} ; main () 
return false;// not in the hash table { 
HashTable ht; 

private: as 

int mapFunc (int idNum) const bool x; 

{ return idNum % NUM BUCK; } = ht.search (29) ; 

= ht.search (175) ; 


BUCKET m buckets [NUM BUCK] ; 
= ht.search (20) ; 


he 


const int NUM _ BUCK 


idNum: | 79 | used: 
idNum: i used: 


class HashTable 
{ 


public: 175 


bool search(int idNum) 
{ 


ee ae 
> ee ea 
7\i id Num: PZH used: 
Ji 


int bucket = mapFunc (idNum) ; 


for (int tries=0;tries<NUM_BUCK;tries++) 
{ 
if (m buckets [bucket] .used == false) 
return false; 
if (m buckets [bucket] .idNum == idNum) 
return true; 


bucket = (bucket + 1) % NUM_BUCK; 
} ; main () 
return false;// not in the hash table { 


Baecniabie hee 


private: as 
int mapFunc (int idNum) const bool x; 
{ return idNum % NUM BUCK; } = ht.search (29) ; 


BUCKET m buckets [NUM BUCK] ; = ht.search (175) ; 
}; = ht.search (20) ; 


const int NUM _ BUCK 


JofidNum: [f idNum: (Z9) used: (| 
us tt seca wi 
idNum:[___ lused: |f | 
idNum:[ lused: [f] 
jidNum: L__]used:[# | 
idNum: [65 ] used: |T | 
idNum: [15 lused:|T]| 
idNum: [175] used: | T] | 
idNum:[___ lused: [f] 
idNum: [29 ] used: | T|| 


class HashTable 
{ 


public: 20 
bool search(int idNum) 


{ 
int bucket = mapFunc (idNum) ; 


for (int tries=0;tries<NUM_BUCK;tries++) 
{ 
if (m buckets [bucket] .used == false) 


return false; 
if (m buckets [bucket] .idN == idNum) 
return true; 
bucket = (bucket + 1) % NUM_BUCK; 


} , main () 
return false;// not in the hash table { 


Bacniabie hee 


private: as 
int mapFunc (int idNum) const bool x; 
{ return idNum % NUM BUCK; } = ht.search (29) ; 


BUCKET m buckets [NUM BUCK] ; = ht.search (175) ; 
}; = ht.search (20) ; 


What Can you Store in your Hash Table? 


truct Bucket 
Oh, and if you like, you can include 7 eas 

additional associated values ae 
(e.g.,a name, GPA) in each bucket! 


string 
float 
bool 


For instance, what if 
I want to also store 
i 1 h d, st & 3 
ahe snm ani stane ae search(int i string &name 
and GPA in each int bucket = 
bucket along with 
their ID#? 


You can do that! 


float &GPA) 


mapFunc (idNum) ; 


for (int tries=0;tries<NUM_BUCK;tries++) 
{ 


if (m buckets [bucket] .used == false) 
return false; 
Now when you look = (m buckets [bucket] .idNum == idNum) 
A T name = m buckets [bucket] .name; 
ie see : GPA = m buckets [bucket] .GPA; 
ge viel on i return true; 


bucket = (bucket + 1) % NUM_BUCK; 
} 


return false;// not in the hash table 


“ Linear Probing: Deleting? 


So far, we've seen how to insert items into our Linear Probe hash table. 
What if we want to delete a value from our hash table? 
Let's take a naive approach and see what happens... 


To delete a value, let's just zero out our value and set the 
used field to false... For instance, let's delete 65. 


0 idNum: [ 79 J used: o 
1 |idNum: | used: [f | 1 
2lidNum: [_] used: 2\idNum 
3 |idNum: used: 3 |idNum 
To delete the value z idNum: E] used: HE 4 We zeroed out idNum 
; idNum: [65 ] used: 5 
of 65 in slot #5. 6 lidNum: [15 used: 6 and set used to false. 
7 \idNum: [ 175] used: 7li 
8lidNum: [J used: | ¥ | 8 |idNum: 
9 }idNum: [ 29 ] used: 9|idNum: [29 | used: [ T] 


If we delete a value where a collision happened... 


When we try to search again, we may prematurely abort our search, failing to find the sought-for value. 
For example, if we search for 15 (in the table on the right), our algorithm will go to slot #5. We'd find 
that slot #5 is empty, and we'd abort our search. Eek! 


So, as you can see, if we simply delete an item from our hash table in a naive, we have problems! 
There are ways to solve this problem with a Linear Probing hash table, but they're not recommended! 


So, in summary, only use Closed/Linear Probing hash tables 
when you don't intend to delete items from your hash table. 


Like if you're building a hash table that holds words for a dictionary...You'll just add words, never delete any, right? 


The “Open Hash Table" 


We just saw how to use linear probing to deal with 
collisions in our closed hash table. 


Our closed hash table + linear probing works just fine, but 
it still has a few problems: 
It's difficult to delete items 


It has a cap on the number 
of items it can hold... That's a bummer. 


It'd be nice if we could find a way to avoid both of 
these problems, yet still have an O(1) table! 


We can! And it's called the “Open Hash Table." 
Let's see how it works! 
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Insert the following values: 1, 3, 11, 25, 101 


The “Open” Hash Table 


Idea: Instead of storing our values directly in the array, 
each array bucket points to a linked list of values. 


To search for an item: 
As before, compute a bucket 


# with your mapping function: 


bucket = mapFunc(idNum) ; 


Search the linked list at 
array[bucket] for your item 


If we reach the end of the 
list without finding our item, 
it's not in the table! 


array of 


Cool! Since the linked 
list in each bucket can 
hold an unlimited 
numbers of values... 
Our open hash table is 
not size-limited like 
our closed one! 


WO CONDOR WNHKPFO 


37 


The “Open” Hash Table: Deletions 


Question: 
How do you delete an item from an open hash table? 


Answer: 
You just remove the value from the linked list. array of 


Let's delete the student with ID=11. We just po 
relink ID=1 to ID=101 and use the C++ delete 
command to delete node 11. 


Cool! Unlike a closed hash table, you can easily 
delete items from an open hash table! 


If you plan to repeatedly insert and 
delete values into the hash table, then 
the Open table is your best bet! 


Also, you can insert more than N items 
into your table (and still have great 
performance)! 


WO oNcCcCARARAUUNEeO 


Why? Because each bucket can hold 
more than one item! 


" Hash Table Efficiency 


Question: How efficient is the hash table ADT? 
How long does it take to locate an item? 
How long does it take to insert an item? 
Answer: 
It depends upon: 
(a) The type of hash table (e.g., closed vs. open), 
(b) how full your hash table is, and 


(c) how many collisions you have in the hash table. 


~ Hash Table Efficiency 


0 idNum: -1 GPA: 


Name: etc... 

Let's assume we have a completely -— | 

idNum: -1 GPA: 

(or nearly) empty hash table... 1|Name: etc.. 

I ° i : -1 : 

What's the maximum number of steps oe nates 
required to insert a new value ? 3 

Name: etc... 

Right! There's zero chance of 4 |idNum: -1 GPA: 

collision, so we can add our new value Name: etc.. 

in one step! 5 | idNum: -1 GPA: 

Name: etc... 

And finding an item in a nearly-empty 6 | idNum: -1 GPA: 

hash table is just as fast! Dai 

7 idNum: -1 GPA: 

Name: etc... 

We have no collisions so either we g [idNum: -1 GPA 

find an item right away or we know it's Name: etc.. 

not in the hash table... g | idNum: -1 GPA: 


Name: etc... 


-= Hash Table Efficiency 


Ok, but what if our hash table is nearly full? 


idNum: 89 GPA:3.87 
Name: Tad etc... 
idNum: 21 GPA: 4.0 
What's the maximum number of steps required to Name: Abe etc... 
insert a new value ? 2 idNum: 12 GPA: 3.2 
Name: Ben etc... 


3 idNum: 42 GPA: 3.9 
Name: Liz etc... 


Right! It could take up to N steps. Why? Well 
let's say I want to insert a new item in slot #6. 
Since the table is already full, we have to keep 
probing down until we hit the first open slot. That 4 o a 
might be N slots away, in this case in slot 5. Name: Al etc... 


And searching can take just as long idNum: GPA: 


Name: etc... 
in the worst case... a aa a 
6 idNum: 06 GPA:3.89 


So technically, a hash table can be up to O(N) Name: Jill etc... 


when it's nearly full! idNum:67 GPA:3.4 
7 Name: Hoa etc... 
So how big must we make our hash table so it runs idNum:78 GPA:17 
quickly? To figure this out, we first need to learn 8 Names Bill eae | 


about the “load” concept... 
9 idNum: 29 GPA:2.1 


Name: Nat etc... 


‘Hash Table Efficiency: The Load Factor 


The “load" of a hash table is the 
maximum number of values you intend to add 
divided by 
the number of buckets in the array. 


Max # of values to insert 
Total buckets in the array 


Example: A load of L=.1 means your array has 10X more 
buckets than you need (you'll only fill 10% of the buckets). 


Example: A load of L=.9 means your array has 10% more 
buckets than you need (you'll fill 90% of the buckets). 


4 E 
Closed Hash w/Linear Probing Efficiency:==- 


Given a particular load L for a Closed Hash Table w LP, = 


idNum: -1 
7 me: 


it's easy to compute the average # of tries it'll take 


you to insert/find an item: 


Average # of Tries = 4(1+ 1/(1-L)) for L< 1.0 


So, if your closed hash table has a 


E2|E2|E2|E2|E2|E2| Fe 
slags sel selselsglse 
CIFP EIF el el EIL eae 


8 
9 


load factor of your search will take 


.10 (your array is 10x bigger than required) 
.20 (your array is 5x bigger than required) 
.30 (your array is 3x bigger than required) 


.70 (your array is 30% bigger than required) 
.80 (your array is 20% bigger than required) 
.90 (your array is 10% bigger than required) 


~1.05 searches 
~1.12 searches 
~1.21 searches 


~2.16 searches 
~3.00 searches 
~5.50 searches 
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Open Hash Table Efficiency 


WOANDTAHRWNHHYO 


NULL 


Given a particular load L for an Open Hash Table, it's also easy 
to compute the average # of tries to insert/find an item: 


Average # of Checks = 1 + L/2 


So, if your open hash table has a 


load factor of your search will take 
.10 (your array is 10x bigger than required) ~1.05 searches 
.20 (your array is 5x bigger than required) ~1.10 searches 
.30 (your array is 3x bigger than required) ~1.15 searches 


.70 (your array is 30% bigger than required) ~1.35 searches 
.80 (your array is 20% bigger than required) ~1.40 searches 
.90 (your array is 10% bigger than required) ~1.45 searches 
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e: Nat 


Closed Hash w/L.P. 


Avg Steps 


~1.05 searches 
~1.12 searches 
~1.21 searches 


~2.16 searches 
~3.00 searches 
~5.50 searches 


Closed vs. Open Hash Table 


Open Hash 


Load Avg Steps 


~1.05 searches 
~1.10 searches 
~1.15 searches 


~1.35 searches 
~1.40 searches 
~1.45 searches 


Moral: Open hash tables are almost ALWAYS more 


efficient than Closed hash tables! 
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Sizing your Hash Table 


Challenge: 


If you want to store up to 1000 items in an Open Hash Table and be able to find 
any item in roughly 1.25 searches, how many buckets must your hash table have? 


Remember: Expected # of Checks = 1 + L/2 


Answer: 
Part 1: Set the equation above equal to 1.25 and solve for L: 


125=1+L/2 —> .25=L/2 —> 5=L 


Part 2: Use the load formula to solve for “Required size": 
# of items to insert 
i 5 1000 


QT, => Required hash table size = 1000. 
Required hash table size ii Required hash table size 4 5 


If our hash table has 2000 buckets and we're inserting a maximum of 1000 values, we are guaranteed to 
have an average of 1.25 steps per insert/search! 


This result means: 
“If you want to be able to find/insert items into your open hash table in an average of 1.25 steps, you need 
a load of .5, or roughly 2x more buckets than the maximum number of values you'll put into your table.” 


So basically it's a tradeoff! 


You could always use a really big hash table with 
way-too-many buckets and ensure really fast searches... 


But then you'll end up wasting lots of memory... 


On the other hand, if you have a really small hash table 
(with just barely enough room), it'll be slower. 


Finally, when choosing the exact size of your hash table 
(the number of buckets)... 


Always try to choose a prime number of buckets... 


Instead of 2000 buckets, 
give your hash table 2017 buckets. 


This causes more even distribution and fewer collisions! 
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What Happens If... 


What happens if we want to allow 
the user to search by the student's 
name instead of their ID number? 


Well, our original mapping 
function won't quite work: 


A hash function is a function that 
takes an arbitrary input (like a string)... 


And produces an integer output, like a 
value between O and 2 billion. 


int mapFunc(int ID) int mapFundé ring &name) 


{ 
return(ID % 100000) 


} 


int h = hash (name) ; 
return h % 100000; 


Well, we need a two-step process! 


First, we need to compute a unique numeric value 
from our string using a “hash” function! 


Second, we use our modulo as before to compute a 
bucket number that fits into our hash table. 
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A Hash Function for Strings 


Here's one possibility for a hash function that can convert a string into 
an integer value. 


int hash(string &name) ee es kai os 


int i, total=0; Hint: 
. , l What h 
for (i=0;i<name.length(); i++) Pue rah to 


total = total + name[i]; 
What happens 


if we hash “TAB"? 


return(total); 


How can we fix it? 


‘stndui JOJIUIS UAD uo} SNS 

juadatsip ÁJƏA SAIG yous uoipouny ysoy o uom Aj|Dap! 2M ‘(,,9VS,,)YSoU 
== (,9VL,)yusoy == („LY g,)ysoy ‘ajdwoxa Joy 'sndu: awos aut 

jo Áuow uo} anjona 4nd4no awos ay4 Saonpoud uoipouny ysoy siy, :uamsuy 
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A Better Hash Function for Strings 


Here's better version of our string hashing function - while not perfect, it 
disperses items more uniformly in the table. Notice that it takes the position 
of each character into account when computing its result. 


int hash(string &name) 


int i, total=0; 


for (i=0;i<name.length(); i++) 
total = total + (i+1) * name[i]; 


return(total); 


Now "BAT" and "TAB" hash to different slots in our 
array since this version takes character position into account. 


But this function still isn't great. Coming up with good hash functions is a 
PhD-level exercise! 
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A GREAT Hash Function for Strings 


Rather than write your own hash function from scratch, why not use one written by the pros? 


C++ provides a great string hashing function: 


Make sure to 
#include<functional> 
to use C++'s hash function! 


We'll define our own mapping function, but 
leverage C++'s hash algorithm under the hood. 


#include <functional> 


First you define a 
C++ string hashing 
object. 


unsigned int yourMapFunction(const std::string &hashMe 


ones, std::hash<std::string> str_hash; // creates a string hasher! 
a beet unsigned int hashValue = str_hash(hashMe); // now hash our string! 
value = 
b 
Oand 4 just add your own modulo ie eas H 
bilion. f unsigned int bucketNum = hashValue % NUM_BUCKETS; ea 


your input string. 
return bucketNum: 


Finally, you apply your own modulo 
function and return a bucket # that 
fits into your hash table's array. 
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Writing Your Own Hash Function 


Great! But what if you need to write a hash function for 
some non-standard data type? 


unsigned int yourMapFunction(const SomeCrazy TypeOfData &hashMe) 
{ 


Like hashing... 


Geospatial coordinates 
An array of N numbers 
The contents of a data file 


This is a non-trivial exercise! 


You really need to understand the “nature” of the data 
youre hashing... 


Then design your algorithm, analyze it, and iterate. 


Choosing a Hash Function: Tips 


1. The hash function must always give us the same output 
value for a given input value: 


Today: hash(400683948) > 83,948 
Tomorrow: hash(400683948) > still 83,948 


2. The hash function should disperse items throughout the 
hash array as randomly as possible. 


hash(“abc") = 294 


Not good! 
hash("cba") = 294 


3. When coming up with a new hash function, always 
measure how well it disperses items (do some experiments!) 
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The unordered_map: A hash-based version of a map 
#include <unordered_map> 


#include <iostream> 
#include <string> 


using namespace std; 


int main( ) 
unordered_map <string,int> hm; // define a new U_M 
unordered_map <string,int>::iterator iter; // define an iterator for a U_M 
hm["Carey"] = 10; // insert a new item into the U_M 
hm["David"] = 20; 
iter = hm.find("Carey"); // find Carey in the hash map 
if (iter == hm.end()) // did we find Carey or not? 
cout << “Carey was not found!"; // couldn't find “Carey” in the hash map 
else 
cout << "When we look up " << iter->first: // “When we look up Carey" 
cout << " we find " << iter->second; // “we find 10" 
} 


} 


” Hash Tables vs. Binary Search Trees 


Hash Tables Binary Search Trees 
Speed O(1) regardless of # O(logN) 
of items 


implici Easy to implement More complex to 
Simplicity Y p implement 
Closed: Limited by array size 


load impacts performance 


Wastes a lot of Only uses as much 

Space space if you have a memory as needed 

Efficiency large hash table (one node per item 
holding few items inserted) 


Ordering No ordering (random) Alphabetical ordering 


> Tables 


EVERYTIMEYOU ae THE|DATABASE 


Re 


memegenerator:net 


Tables 
Why should you care? 


Tables are the building block of 
databases (like Oracle & MySQL) 


They're used to organize large amounts 
of data and make it quickly searchable. 


Tables are used to: 
Hold your $$ bank account data 
Store your student transcripts 

Hold your credit card transactions 

Hold usernames/pws for most sites 


So pay attention! 
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"Tables" 


Let's say you want to want to 
write a program to keep track 
of all your BFFs... 


Of course, you want to 
remember all the important 
dirt about each BFF: 


And you want to quickly be 
able to search for a BFF in 
one or more ways... 


“ Find all the dirt on my BFF 
‘David Johansen’ “ 


“ Find all the dirt on the BFF 
whose number is 867-5309 " 


~ Name: Carey Nash 
Phone number: 867-5309 
Birthday: July 28 
iPhone or ‘droid: iPhone 
Social Security #: 111222333 
Favorite food: ... 
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"Tables" 


In CS lingo, a group of related data 
is called a “record.” 


Each record has a bunch of “fields” 
like Name, Phone #, Birthday, etc. 
that can be filled in with values. 


If we have a bunch of records, 
we call this a “table.” Simple! 


While you may have many records with 
the same Name field value (e.g., John 
Smith) or the same Birthday field 
value (e.g., Jan 15°)... 


Some fields, like Social Security 
Number, will have unique values across all 
records - this type of field is useful for 

searching and finding a unique record! 


Name: Carey Nash 
Phone number: 867-5309 
Birthday: July 28 


iPhone or ‘droid: iPhone 
Social Security #: 11122233 
Favorite food: ... 


David Small 


Name: John Rohr 
Phone number: 999-9191 
Birthday: Jan 1 
iPhone or ‘droid: Droid 
Social Security #: 47372727 
Favorite food: ... 


A field (like the SSN) that 
has unique values across all 


records is called a 
“key field."GQ 
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Implementing Tables struct Student 


{ 
How could you create a record in C++? string name; 
Answer: Just use a struct or class int IDNum; 
to represent a record of data! float GPA; 


string phone; 


n 


How can you create a table in C++? 


Answer: You can simply create an 
array or vector of your struct! 


vector<Student> table; 


// algorithm to search by the name field 
int SearchByName(vector<Student> &table, string &findName) 


for (int s = O; s < table.size(); s++ ) 
if (findName == table[ s ].name) 
return(s );// the student you're looking for is in slot s 
return( -1 ); // didn't find that student in your table 
// algorithm to search by the phone field 
int SearchByPhone(vector<Student> &table, string &findPhone) 
{ 


} 


for (int s = 0; s < table.size(); s++ ) 
if (findPhone == table[ s ].phone) 
return(s );// the student you're looking for is in slot s 
return( -1 ); // didn't find that student in your table 
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Implementing Tables 


Heck, why not just create a ‘a ei 


whole C++ class for our table? string name; 

class TableOf Students int IDNum, 
{ float GPA; 
public: string phone; 

TableOfStudents(); // construct a new table ae: 

~TableOfStudents(); // destruct our table 

void addStudent(Student &stud); // add a new Student 

Student getStudent(int s); // retrieve Students from slot s 

int searchByName(string &name); // name is a searchable field 

int searchByPhone(int phone); // phone is a searchable field 


private: 


vector<Student> m_students; 


}: 
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Tables 


In the TableOf Students class, we used a vector to hold our table 
and a linear search to find Students by their name or phone. 


a Name: David 
This is a perfectly valid table - but it's slow to find | ID #: 111222333 


a student! How can we make it more efficient? GPA: 2.1 


Phone: 310 825-1234 


Well, we could alphabetically sort our 
vector of records by their names... Name: John 


ID #: 95847362 
Then we could use a binary search to cee 
, y Phone: 818 416-0355 
quickly locate a record based on a 
erson's name. 
P j Name: Carey 
But then every time we add a new ID DS ae 
record, we have to re-sort the whole See 
table. Yuck! Phone: 424 750-7519 
, : Name: Albert 
And if we sort by name, we can't ID #: 012191928 
search efficiently by other fields like 


GPA: 1.5 
phone # or ID #! 


Phone: 626 599-5939 
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Tables 


Hmmm... What if we stored our records ina binary search tree 
(e.g., a map) organized by name? Would that fix things? 


Name: David 
ID #: 111222333 
GPA: 2.1 
Phone: 310 825-1234 


Name: John 
Name: Albert | 
ID #: 012191928 ID eee 
GPA: 1.5 3: 


Phone: 626 599-5939 Phone: 818 416-0355 


Name: Carey 
ID #: 400683945 
GPA: 4.0 
Phone: 424 750-7519 


Well, now we can search the table efficiently by name... 


But we still can't search efficiently by ID# or Phone #.... 
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Tables 


Hmmm... What if we create two tables, 
ordering the first by name and the second by ID#? 


Name: David 
ID #: 111222333 
GPA: 2.1 
Phone: 310 825-1234 


Name: David 
ID #: 111222333 
GPA: 2.1 
Phone: 310 825-1234 


Name: Albert Name: John Name: Albert Name: Carey 
ID #: 012191928 ID #: 95847362 ID #: 012191928 ID #: 400683945 
GPA: 1.5 GPA: 3.8 GPA: 1.5 GPA: 4.0 


Phone: 626 599-5939 Phone: 818 416-0355 Phone: 626 599-5939 HOME) OG TESTE) 


Name: John 
ID #: 95847362 
GPA: 3.8 
Phone: 818 416-0355 


Name: Carey 
ID #: 400683945 


A: 4.0 
Phone: 424 750-7519 


That works... Now I can quickly find people by name or ID#! 


But now we have two copies of every record, one in each tree! 
If the records are big, that's a waste of space! 


So what can we do? Let's see! 


“Making an Efficient Table 


1. We'll still use a vector to store all of our records... 


2. Let's also add a data structure that lets us associate 
each person's name with their slot # in the vector... O 


3. And we can add another data structure to associate 
each person's ID # with their slot # too! 


Our second data 
structure lets us 
quickly look up a name 


class TableOf Students 


ane N, and find out which slot 
A in the vector holds the 
~TableOfStudents(); related record. 


void addStudent(Student &stud); 

Student getStudent(int s); 

int searchByName(string &name); 

int searchByPhone(int phone); 
private: 

vector<Student> m_students; 

map<string,int> m_nameToSlot;: 


called “indexes.” 


m_students 


name: Alex 
GPA: 2.05 
ID: 7124 


name: Linda 
GPA: 3.99 
ID: 0003 


name: Jason 
GPA: 1.55 


Our third data structure 
lets us quickly look up an 
ID# and find out which 
slot in the vector holds 
the related record. 


These secondary data structures are 


Each index lets us efficiently find a 


map<int,int> m_idToSlot; record based ona particular field. 


map<int,int> m_phoneToSlot; We may have as many indexes 
}: E for our application. 


as we need 


® Making an Efficient 


So what does our addStudent 
method look like now? 


Well, we have to add our new student record to 
our vector just like before. 


But now, every time we add 


a record, weve also got to 
add the name to slot # 
mapping to our first map! 


m_students.push_back(stud); 

int slot = m_students.size()-l--7 
m_nameToSlot[stud.name] = slot; // maps name to slot # 
m_idToSlot[stud.IDNum] = slot; // maps ID# to slot # 


private: 
vector<Student> m_students; 
map<string,int> m_nameToSlot;: 
map<int,int> m_idToSlot: 
}: 


Finally, every time we add a 
record, we've also got to add the 
ID# to slot # mapping to our 
second map! 


66 C | T b | m_stTudents 
Ol l \p ex a es name: Alex 
So to review, what do we have to do to insert a new record O GPA: 2.05 
into our table? ID: 7124 
Let's add: Wendy, ID=1000, GPA=3.9 name: Linda 
Bef ddi 1 GPA: 3.99 
efore adding : 
Wendy: [meer ie gi ID: 0003 
name: Jason 
ae | leur cnet ial le a GPA: 1.55 
an E em mg 2 ID: 1054 
ao lee lan Bee eee | lee name: Abe 
nuli TPT, (AT nalii TT nu Em oul em oll oul GPA: 4.00 
ig oe 
After addi e a z 
Wendy i 2 name: Zelda 
_ | GPA: 3.43 
oo lee pelle ID: 6416 
name: Abe cha name: Zelda IbD#: 0003 || ID#: =a 9876 name. Carey 
Slot: 3 Slot: 2 Slot: 4 Slot: 1 Slot: 5 Slot: 3 GPA: 3 62 
ea RY creas a seein ID: 4006 
Name:Wendy ID#: 1000 = 
index: 5 index: N 
m m nul nul name: Wendy 


GPA: 3.9 
ID: 1000 


But wait!!!! - Any time you delete a record or update a record's searchable 
fields, you also have to update your indexes! 
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Tables 


As it turns out, databases like “Oracle” use exactly 
this approach to store and index data! 


(The only difference is they usually store their 
data on disk rather than in memory) 
And by the way... While my example used 


binary search trees to index our table's fields... 


You could use any efficient data structure you like! 


For example, you could use a hash table! 
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Using Hashing to Speed Up Tables 


name: Alex 
Can we use hash tables to index our data instead of binary search trees? GPA: 2.05 
Of course! ID: 7124 
Now we can have O(1) searches by name! Cool! But in that case why 1 | name: Linda 
not just always use hash tables to index all of our key fields? GPA: 3.99 
Answer: Because hash tables store the data in an essentially TD: 0003 


random order. 
name: Jason 


While a BST is slower, it does order the key fields in alphabetical order... | GPA: 1.55 


For instance, what if we want to be able to print out all ID: 1054 
students alphabetically by their name. AE 
If our index data structure is a binary search tree, that's easy! GPA: 4.00 
If we indexed with a hash table, we'd have to do a lot more ID: 9876 


work to do the same thing... 


ID#: 6416 
Slot: 4 
ID#: 1054 ID#: 7124 
Slot: 2 Slot: O 


ID#: 0003 ID#: 4006 ID#: 9876 
Slot: 1 Slot: 3 


Moral: You need to understand how your 
table will be used to determine how to 
best index each field. 


For example: 


= 


I'd use a BST for the name field so I can 
print people's names in alphabetical order. 
But I'd use a hash table for the phone field, 
cause I just need to search quickly but I 

on't need to order records by their phone 


OANDORWNHKrO 
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Challenges 


Question: What is the big-oh of traversing all of the elements ina hash 
table? 
Question: I have two hash tables: the first has 10 buckets, and the second 


has 20 buckets. If I insert each of the following IDs into each hash table, 
where will each ID number end up (which bucket #s)? 


ID=5 

ID = 15 
ID = 25 
ID = 100 


Question: How can you print out the items in a hash-table in 
alphabetical/numerical order. 
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